Model Selection

Vision-Language Model

# Vision-Language Model

GUI Actor 2B Qwen2 VL

GUI-Actor-2B is a vision-language model based on Qwen2-VL-2B, specifically designed for graphical user interface (GUI) positioning tasks. By adding an attention-based action head and fine-tuning, it performs well in multiple GUI positioning benchmark tests.

WebDreamer is a planning framework capable of achieving efficient and effective planning for web agent tasks in the real world.

Transformers English

Gemma 3 27b It GGUF

GGUF quantized version of Gemma 3 with 27B parameters, supporting image-text interaction tasks

STEVE R1 7B SFT I1 GGUF

This is a weighted/matrix quantized version of the Fanbin/STEVE-R1-7B-SFT model, suitable for resource-constrained environments.

Text-to-Image English

Gemma 3 4b It GGUF

Gemma 3 is a lightweight open-source multimodal model from Google, supporting text and image inputs with text outputs, featuring a 128K context window and support for 140+ languages.

Q-SiT Mini is a lightweight image quality assessment and dialogue model, focusing on image quality analysis and scoring.

Llama 3 2 11b Vision Electrical Components Instruct

Llama 3.2 11B Vision Instruct is a multimodal model combining vision and language, supporting image-to-text tasks.

Image-to-Text English

Llava NeXT Video 7B Hf

LLaVA-NeXT-Video-7B-hf is a video-based multimodal model capable of processing video and text inputs to generate text outputs.

Video-to-Text English

Libra Llava Med V1.5 Mistral 7b

LLaVA-Med is an open-source large vision-language model optimized for biomedical applications, built on the LLaVA framework, enhanced through curriculum learning, and fine-tuned for open-ended biomedical question answering tasks.

Florence 2 Base Castollux V0.4

An image caption generation model fine-tuned based on microsoft/Florence-2-base, focusing on improving description quality and format

Transformers English

PJMixers-Images

LLaVA-Llama3 is a multimodal model based on Llama-3, supporting joint processing of images and text.

UI-TARS is a next-generation native graphical user interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.

Transformers Supports Multiple Languages

UI-TARS is a next-generation native graphical user interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.

Transformers Supports Multiple Languages

bytedance-research

UI-TARS is a next-generation native Graphical User Interface (GUI) agent model designed to seamlessly interact with GUIs through human-like perception, reasoning, and action capabilities.

Transformers Supports Multiple Languages

Deqa Score Mix3

DeQA-Score-Mix3 is a no-reference image quality assessment model fine-tuned based on the MAGAer13/mplug-owl2-llama2-7b foundation model, demonstrating excellent performance across multiple datasets.

Transformers English

Colqwen2 7b V1.0

A visual retrieval model based on Qwen2-VL-7B-Instruct and ColBERT strategy, supporting multi-vector text and image representation

Text-to-Image English

A multimodal large language model developed based on the paper 'Task Preference Optimization: Improving Multimodal Large Language Models through Visual Task Alignment'

Olympus is a general task routing system designed for computer vision tasks, capable of handling 20 different visual tasks and achieving efficient multi-task processing through task routing mechanisms.

Transformers English

Llava Critic 7b Hf

This is a transformers-compatible vision-language model with image understanding and text generation capabilities

BLIP Radiology Model

BLIP is a Transformer-based image captioning model capable of generating natural language descriptions for input images.

Vit GPT2 Image Captioning Model

An image caption generation model based on the ViT-GPT2 architecture, capable of converting input images into descriptive text

A visual retrieval model based on Qwen2-VL-2B-Instruct and ColBERT strategy, capable of efficiently indexing documents through visual features

Text-to-Image English

Cogflorence 2.2 Large

This model is a fine-tuned version of microsoft/Florence-2-large, trained on a 40,000-image subset of the Ejafa/ye-pop dataset, with annotation texts generated by THUDM/cogvlm2-llama3-chat-19B, suitable for image-to-text tasks.

Transformers Supports Multiple Languages

Lumina Mgpt 7B 512

Lumina-mGPT is a family of multimodal autoregressive models excelling in various vision and language tasks, particularly in generating flexible and realistic images from text descriptions.

Cogflorence 2 Large Freeze

This is a fine-tuned version of the microsoft/Florence-2-large model, trained on a subset of 38,000 images from the Ejafa/ye-pop dataset, using CogVLM2-generated annotations, focusing on image-to-text tasks.

Transformers Supports Multiple Languages

Vit Base Patch16 224 Distilgpt2

DistilViT is an image caption generation model based on Vision Transformer (ViT) and distilled GPT-2, capable of converting images into textual descriptions.

Tic CLIP Bestpool Sequential

TiC-CLIP is a vision-language model trained on the TiC-DataComp-Yearly dataset, employing continual learning strategies to keep the model synchronized with the latest data

Tic CLIP Bestpool Oracle

TiC-CLIP is an improved vision-language model based on OpenCLIP, focusing on temporal continual learning, with training data spanning from 2014 to 2022

Llava Phi 3 Mini 4k Instruct

A vision-language model that combines the Phi-3-mini-3.8B large language model with LLaVA v1.5, providing advanced vision-language understanding capabilities.

Llava Phi 3 Mini Gguf

LLaVA-Phi-3-mini is a fine-tuned LLaVA model based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, specializing in image-to-text tasks.

Vlrm Blip2 Opt 2.7b

A BLIP-2 OPT-2.7B model fine-tuned with reinforcement learning, capable of generating long and comprehensive image descriptions

Transformers English

Blip Finetuned Fashion

This model is a visual question answering model fine-tuned from Salesforce/blip-vqa-base, specializing in the fashion domain

Thai Trocr Thaigov V2

A Thai handwritten recognition model based on vision encoder-decoder architecture, suitable for various Thai OCR tasks

Transformers Other

InfiMM-HD is a high-resolution multimodal model capable of understanding and generating content that combines images and text.

Transformers English

A vision-language model initialized from OpenAI CLIP, adversarially fine-tuned on ImageNet with enhanced robustness features

Vision-language model initialized with OpenAI CLIP, enhanced robustness through unsupervised adversarial fine-tuning

Internlm Xcomposer2 7b 4bit

InternLM-XComposer2 is a vision-language large model (VLLM) based on InternLM2, featuring advanced image-text understanding and creation capabilities.

Internlm Xcomposer2 Vl 7b 4bit

A vision-language large model based on InternLM2, with outstanding image-text understanding and creation capabilities

Quilt Llava V1.5 7b

Quilt-LLaVA is an open-source chatbot fine-tuned on LLaMA/Vicuna using multimodal instruction-following data generated from histopathology educational videos and GPT.

Moe LLaVA Qwen 1.8B 4e

MoE-LLaVA is a large vision-language model based on the Mixture of Experts architecture, achieving efficient multimodal learning through sparse activation parameters

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase